Section - 003

 library(dataReporter)
 library(pointblank)
 library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x dplyr::summarize() masks dataReporter::summarize()
 library(here)
## here() starts at /Users/preet/Desktop/Fourth Sem/DAB-402/Sanket_Project/Assessment-2
 library(stringr) 
 library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

‘image_details’ is a dataframe we made, consisting of names of all images extracted and their respective dimensions.We’ll have a look at it to understand the image dataset we have at hand.

# Loading the image_details dataframe
 
 image_details <- read_csv('image_details.csv')
## Warning: Missing column names filled in: 'X1' [1]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   image_name = col_character(),
##   width = col_double(),
##   height = col_double()
## )
# Let's have a look at it's content
    head(image_details)
## # A tibble: 6 × 4
##      X1 image_name                        width height
##   <dbl> <chr>                             <dbl>  <dbl>
## 1     0 100_litfl_other_convex_frame0.jpg   480    360
## 2     1 100_litfl_other_convex_frame1.jpg   480    360
## 3     2 100_litfl_other_convex_frame2.jpg   480    360
## 4     3 100_litfl_other_convex_frame3.jpg   480    360
## 5     4 100_litfl_other_convex_frame4.jpg   480    360
## 6     5 100_litfl_other_convex_frame5.jpg   480    360
  nrow(image_details)
## [1] 18628

The above output shows that we have extracted over 18K images using the data scraping script available. However, we’d like to spend some time extracting more images for the project work, since the model performance directly depends on the amount of data we get to train our model on.

The first column ‘image_name’ contains significant information about the image like the source it has been extracted from, the probe, the class it belongs to. It’d be useful to get these details to get a better picture of the image dataset.

#Splitting the name column by delimiter '_'

  image_details[c('X', 'Source','Class', 'Probe', 'X1')] <- str_split_fixed(image_details$image_name, '_', 5)
 
  # Rearrange columns and remove original name column
  image_details <- image_details[c('X1','Source','Class', 'Probe','width','height')]
 
  #Renaming the first column
  
 names(image_details)[names(image_details) == "X1"] <- "image"
head(image_details)
## # A tibble: 6 × 6
##   image      Source Class Probe  width height
##   <chr>      <chr>  <chr> <chr>  <dbl>  <dbl>
## 1 frame0.jpg litfl  other convex   480    360
## 2 frame1.jpg litfl  other convex   480    360
## 3 frame2.jpg litfl  other convex   480    360
## 4 frame3.jpg litfl  other convex   480    360
## 5 frame4.jpg litfl  other convex   480    360
## 6 frame5.jpg litfl  other convex   480    360

This dataframe can be looked at as the image metadata containing information about the images extracted that make up for our actual dataset.

Assessment for data quality

Identifying issues in all of the columns

#Checking every column to identify any issues

  check(image_details)
## $image
## $image$identifyMissing
## No problems found.
## $image$identifyWhitespace
## No problems found.
## $image$identifyLoners
## Note that the following levels have at most five observations: frame407.jpg, frame408.jpg, frame409.jpg, frame410.jpg, frame411.jpg, ..., frame463.jpg, frame464.jpg, frame465.jpg, frame466.jpg, frame467.jpg (51 values omitted).
## $image$identifyCaseIssues
## No problems found.
## $image$identifyNums
## No problems found.
## 
## $Source
## $Source$identifyMissing
## No problems found.
## $Source$identifyWhitespace
## No problems found.
## $Source$identifyLoners
## No problems found.
## $Source$identifyCaseIssues
## No problems found.
## $Source$identifyNums
## No problems found.
## 
## $Class
## $Class$identifyMissing
## No problems found.
## $Class$identifyWhitespace
## No problems found.
## $Class$identifyLoners
## No problems found.
## $Class$identifyCaseIssues
## No problems found.
## $Class$identifyNums
## No problems found.
## 
## $Probe
## $Probe$identifyMissing
## No problems found.
## $Probe$identifyWhitespace
## No problems found.
## $Probe$identifyLoners
## No problems found.
## $Probe$identifyCaseIssues
## No problems found.
## $Probe$identifyNums
## No problems found.
## 
## $width
## $width$identifyMissing
## No problems found.
## $width$identifyOutliers
## Note that the following possible outlier values were detected: 928, 960, 962, 1068, 1276, 1280, 1920.
## 
## $height
## $height$identifyMissing
## No problems found.
## $height$identifyOutliers
## Note that the following possible outlier values were detected: 197.

-> The above output shows that there are no missing values for any of the records, we do have the information about source, class, probe, dimensions of all images. The same is necessary as we need to use this information to understand proportion of images in all classes, sources we have been able to extract images from and the dimension irregularities among images in the dataset.

-> Regarding the dimensions, there are some images with width values identified as outliers i.e. they have exceptionally less or high width than most of images in our dataset.

-> Same is the case in terms of height, but there is only one identified outlier, an image with height dimension value way less than the rest.

-> It is a useful observation, we’d have to make sure all images to be used for model creation and training have the same dimensios. The same will be taken care of during data pre-processing.

Conducting validation test for image metadata details

The following validation tests are being conducted on the metadata :-

-> Does the class variable contain - covid, normal, pneumonia, other as the possible categories ?

-> Does the probe variable consists of two possible values - Linear, convex?

-> Does the height values lie between 140 and 1000?

-> Does the width values lie between 150 and 1000?

  agent <- 
  create_agent(
    tbl = image_details,
    tbl_name = "Image metadata",
    label = "Validation test"
  ) %>%
  col_vals_in_set(vars(Class), set = c("covid", "normal", "pneumonia","other")) %>%
  col_vals_in_set(vars(Probe), set = c("linear","convex")) %>%
  col_vals_between(vars(width), left = 150, right = 1500) %>%
  col_vals_between(vars(width), left = 140, right = 1000) %>%
  interrogate()

The interrogation returns every step to be ‘OK’ but it’d be helpful to see the agent report.

agent
## Warning: The `fmt_missing()` function is deprecated and will soon be removed
## * Use the `sub_missing()` function instead

## Warning: The `fmt_missing()` function is deprecated and will soon be removed
## * Use the `sub_missing()` function instead
Pointblank Validation
Validation test

tibble Image metadata
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
NA 1
 col_vals_in_set()

Class

covid, normal, pneumonia, other

19K 19K
1
0
0

NA 2
 col_vals_in_set()

Probe

linear, convex

19K 19K
1
0
0

NA 3
 col_vals_between()

width

[150, 1,500]

19K 18K
1
1K
0

NA 4
 col_vals_between()

width

[140, 1,000]

19K 15K
1
4K
0

2022-05-31 23:16:06 EDT < 1 s 2022-05-31 23:16:06 EDT

The report shows that the data passes the first two evaluation tests but not the following two.

 get_data_extracts(agent)
## $`3`
## # A tibble: 1,066 × 6
##    image      Source Class     Probe  width height
##    <chr>      <chr>  <chr>     <chr>  <dbl>  <dbl>
##  1 frame0.jpg core   pneumonia convex  1920   1080
##  2 frame1.jpg core   pneumonia convex  1920   1080
##  3 frame2.jpg core   pneumonia convex  1920   1080
##  4 frame3.jpg core   pneumonia convex  1920   1080
##  5 frame4.jpg core   pneumonia convex  1920   1080
##  6 frame5.jpg core   pneumonia convex  1920   1080
##  7 frame6.jpg core   pneumonia convex  1920   1080
##  8 frame7.jpg core   pneumonia convex  1920   1080
##  9 frame8.jpg core   pneumonia convex  1920   1080
## 10 frame9.jpg core   pneumonia convex  1920   1080
## # … with 1,056 more rows
## 
## $`4`
## # A tibble: 3,584 × 6
##    image      Source Class     Probe  width height
##    <chr>      <chr>  <chr>     <chr>  <dbl>  <dbl>
##  1 frame0.jpg core   pneumonia convex  1920   1080
##  2 frame1.jpg core   pneumonia convex  1920   1080
##  3 frame2.jpg core   pneumonia convex  1920   1080
##  4 frame3.jpg core   pneumonia convex  1920   1080
##  5 frame4.jpg core   pneumonia convex  1920   1080
##  6 frame5.jpg core   pneumonia convex  1920   1080
##  7 frame6.jpg core   pneumonia convex  1920   1080
##  8 frame7.jpg core   pneumonia convex  1920   1080
##  9 frame8.jpg core   pneumonia convex  1920   1080
## 10 frame9.jpg core   pneumonia convex  1920   1080
## # … with 3,574 more rows

The output above shows all the image records that have height and width dimensions beyond the range we specified. Let’s have a look at the dataframe summary to check the same.

Table summary

 #Checking summary of the entire image metadata
  summary(image_details)
##     image              Source             Class              Probe          
##  Length:18628       Length:18628       Length:18628       Length:18628      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      width            height    
##  Min.   : 198.0   Min.   : 197  
##  1st Qu.: 600.0   1st Qu.: 409  
##  Median : 792.0   Median : 540  
##  Mean   : 811.2   Mean   : 589  
##  3rd Qu.: 816.0   3rd Qu.: 720  
##  Max.   :1920.0   Max.   :1350

-> Class and probe are character values we we’d have anticipated. Width and height represent image dimensions and they are numerical, we are able to see the distribution of these values in different quartiles. We can clearly note that the maximum width for the images extracted is over 1900, whereas the height is over 1300, both of these are outside of the range we checked for in the validation test.

Assessment for data fitness

   image_details %>% summarize(n())
## # A tibble: 1 × 1
##   `n()`
##   <int>
## 1 18628

The image dataset has 18628 images, the number is good to get started with. But as per the project objective, the focus is to get a highly accurate COVID-19 prediction model, an important aspect for achieving the performance that satisfies the success criteria is that, the model must be trained on a much bigger set of images than we have now. The more the model gets to learn from, the better. So, we’d work on extracting more images, the dataset right now is not the best possible fit for our objective.

 image_details %>% group_by(Source) %>% summarize(n())
## # A tibble: 8 × 2
##   Source     `n()`
##   <chr>      <int>
## 1 clarius      652
## 2 core        3098
## 3 grepmed     3994
## 4 litfl       2371
## 5 paper       3774
## 6 pocusatlas  1745
## 7 radio        781
## 8 uf          2213

The above output shows the number of images we have been able to extract from every source, there were 9 different data sources, we extracted lung ultrasound videos from. However, the output shows we have not been able to get any images from ‘Butterfly Network’. The 35 videos from this source would have added a good number of images to the dataset. Moreover, Clarius and Radiopaedia contribute very less to the image dataset as compared to others. In order to have a dataset that works great for our modelling part, we must focus on extracting more images from these sources.

 image_details %>% group_by(Class) %>% summarize(n())
## # A tibble: 4 × 2
##   Class     `n()`
##   <chr>     <int>
## 1 covid      4003
## 2 normal     2201
## 3 other      7975
## 4 pneumonia  4449

The dataset has lung ultrasound images with every image belonging to one of the four cases : Covid positive, normal, pneumonia and others. The above code cell represents the number of images our dataset has for all these classes.

 g <- image_details %>%
  group_by(Class) %>%
  summarise(cnt = n()) %>%
  mutate(freq = (cnt / sum(cnt))*100) %>% 
  arrange(desc(freq))

g
## # A tibble: 4 × 3
##   Class       cnt  freq
##   <chr>     <int> <dbl>
## 1 other      7975  42.8
## 2 pneumonia  4449  23.9
## 3 covid      4003  21.5
## 4 normal     2201  11.8

The class imbalance is pretty evident from the output above.The first class Covid has around 4000 images, which is a good number but in comparison to other classes, it might not be very suitable for the intended purpose. We wish to train our model to be capable of distinguishing between lung ultrasound images of COVID positive patients and those having none or other lung conditions. So, it’d make sense to provide it with a good number of images from each class with comparable numbers.

Such a class imbalance would impact model’s ability to perform up to the desired standard.The proportion of images are not fairly distributed among all classes. It might not be possible to ensure equal proportion of all classes, but it is certainly important for the dataset to include good number of ultrasound images of COVID positive patients, which give enough for it to extract features and learn from them.

Data fitness for answering research questions

(I) Which machine learning classification algorithm works the best for diagnosing Covid-19 using medical data of patients (Lung ultrasound images) evaluated using accuracy, precision, sensitivity, F1 score, ROC Curve – AUC Score, Log loss ?

As a matter of fact, performance of a machine learning algorithm, depends upon the amount and type of data it is being used with. We have realized that our image dataset needs to have more images and data is not fairly distributed among all classes. Due to these deficiences in the dataset, it might not be able to bring forward the best performance of algorithms.So, as of know the dataset is not fit enough for getting to the right answer to this question.

(II) Which technique helps the most in achieving the success criteria by optimizing model performance – Transfer learning, Data augmentation, Mix-up augmentation, Progressive resizing, Deep learning libraries like fastai, Hyperparameter tuning using Keras Tuner, fine tuning ?

We can identify the technique that optimizes model’s performance by comparing the results after using the methods in question with the baseline model. The answer can be achieved, but for testing the true capability of these methods, the baseline model must be trained with the dataset that best fits the problem, which is not the case for us as of now.

NOTE -This week, we’d invest some more time working on image extraction, and we’d assess the dataset again to make sure it’s fit for the project objective before putting it to use.

Ethical Assessment for data collection and usage

Data Science Ethics Checklist

Deon badge

A. Data Collection

Since COVID-Net Initiative have collected the videos from the sources, they must have incorporated this ethical principle.We have not retrieved any personal information while extracting the images and the usage is very clear as well.

The lung ultrasound videos have been collected as a part of the open source initiative, the COVID-Net Initiative. We have in turn scraped videos and extracted images from them from the 9 sources listed as a part of Covidx-US dataset. We have been concerned about bias in image extraction from the videos but we did not have much relevant information like age, gender, medical history of patients. But we have been conscious about collecting comparable number of images for each class - Covid, normal, pneumonia, others to reduce chances of class inbalance.

The dataset does not expose the personally identifiable information as it does not contain any such data about the data providers. The videos collected from all the sources do have information about patient’s case, age and gender but most of those values are missing. However, The image extraction process has respected the privacy and anonymization of data providers to the full extent, retrieving only the images and their medical case : Covid, Pnuemonia,Normal or others and no sort of personal details.

N/A

B. Data Storage

The team understands the importance of careful handling of the videos and images extracted for purpose of this project work. The team members will keep the access limited to themselves only, making sure no one except the 5 group mates get to access/use the data. The images will also be saved on google drive with access privileges to the team only.

The dataset does not have any personal information of individuals.

The videos and images extracted will be permanently deleted from every device involved in the project work. Once the final model returns results as per the desired performance standard, it is expected to work on real world data.hence, all traces of data used will be removed.

C. Analysis

N/A

The dataset has been checked for possible sources for bias. The video metadata has records of patient’s age and gender, but most of those are missing, making it hard to make sense of any possible bias on those bases. Our image dataset has 18628 images, with the class distribution as follows:- COVID - 4003, Normal - 2201 , Pnuemonia - 4449, Others - 7975 . The numbers are clearly imbalanced. We intend to train a machine learning model using these images, and the more the model sees the better. Training our model using relatively more images from a specific class than others would have an impact on it’s performance too.

To address the same, we have decided to spend some more time extracting more images from the data sources in order to make the numbers comparable in all classes.

The team has focused their efforts towards representing data as it is in an honest and truthful manner in this data assessment, as the intention is to take the project in the right direction not misguide it.

There is no PII information to be used throughout the course of the project.

Every step of the project will be documented in the best possible manner including code with descriptive comments, adding notes to every method used in creating the model and deploying it so that retracing gets easier.

D. Modeling

Modeling is an integral part of this project work, but we have not started that aspect yet, we are trying to understand and assess the data better at this point. Once we get the model ready, before finalizing it, the following ethical assessment will be carried out and the answers will be documented.

E. Deployment

We are still at the initial stage, but answering the checklist questions to the best of our knowledge at this point.

The objective is to have a model that returns incomparable accuracy. However the possible way in which the model can cause harm is by giving inaccurate results, the False Negatives will definitely be more harmful than the False Positives, a false positive can further be verified with a follow up laboratory test. But a false negative can harm the patient in question. One way of updating the model to prevent future harm would be presenting it more real world images for training or tweaking it’s performance by trying additional methods.The redress has not been discussed in detail yet, so we would not check it off the list as of now.

If the model’s performance raise concerns and unresolvable issues, it would be taken down from it’s web based implementation and it’s use would be discarded.

Concept drift should not be a problem for the model as long as it is used for COVID prediction, if in future a new undiscovered variant of COVID-19 comes into picture, then the model might become unuseful. In order to keep track of it, the model’s performance can be tested against our static baseline model, if it seems to be deteriorating, the same would be handled by retraining the model (on recent/new ultrasound images) and updating it so that it can be keep up with the changes.

Uninteded usage has not been a part of our discussion,so we should not be checking this off. The team would discuss identification and prevention of unintended usage and we’d update our answer.

Data Science Ethics Checklist generated with deon.

Exploratory data analysis

The data assessment revealed the number of images in the dataset and distribution of images among various classes.We’ll now explore aspects of the dataset in detail.

 glimpse(image_details)
## Rows: 18,628
## Columns: 6
## $ image  <chr> "frame0.jpg", "frame1.jpg", "frame2.jpg", "frame3.jpg", "frame4…
## $ Source <chr> "litfl", "litfl", "litfl", "litfl", "litfl", "litfl", "litfl", …
## $ Class  <chr> "other", "other", "other", "other", "other", "other", "other", …
## $ Probe  <chr> "convex", "convex", "convex", "convex", "convex", "convex", "co…
## $ width  <dbl> 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480…
## $ height <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360…

###Visualizing the distribution of height of images captured

  plot_1 <- ggplot(data = image_details)+geom_histogram(aes(x = height),fill="skyblue",col = "black") + 
       labs(x = "Dimension : Height", y= "Number of images", title = "Distribution of height of images collected") + 
  geom_vline(xintercept = mean(image_details$height), color ="red")

 ggplotly(plot_1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Clearly, the distribution of images’ heights seem to be right skewed, with some of the images having height dimension bigger than most of the images. However the count is a little over 240 out of 18K. The mean height of all images extracted is around 600.

###Visualizing the distribution of width of images captured

  plot_2 <- ggplot(data = image_details)+geom_histogram(aes(x = width),fill="pink",col = "black") + 
       labs(x = "Dimension : Width", y= "Number of images", title = "Distribution of width of images collected")+
  geom_vline(xintercept = mean(image_details$width), color = "blue")

 ggplotly(plot_2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Quite similar to the height distribution, width of images extracted also seem to follow the right skewed pattern. It seems to follow a normal distribution pattern centered around 771, A good number of images can be identified as outliers with width dimension way bigger than others.

Identifying outliers with exceptional height and width

 boxplot(image_details$height,
main = "Visualizing height for detecting exceptional values",
xlab = "Dimension : Height",
col = "skyblue",
border = "brown",
horizontal = TRUE,
notch = FALSE
)

The outliers can be visibly noted in the above box-plot. This distribution is important to be understood as the images have to be transformed to a uniform resolution i.e. height and width, having an idea about how does the resolution of images in dataset look like, would be helpful in data pre-processing.

  1.5 * IQR(image_details$height) + summary(image_details$height)[["3rd Qu."]] 
## [1] 1186.5

All images having height beyond the above number have been identified as outliers.

 boxplot(image_details$width,
main = "Visualizing width for detecting exceptional values",
xlab = "Dimension : Weight",
col = "pink",
border = "brown",
horizontal = TRUE,
notch = FALSE
)

Outliers in terms of exceptional width fall on both sides, some images have width lesser than what it is like for most of the dataset while some have width way more than others.

  1.5 * IQR(image_details$width) + summary(image_details$width)[["3rd Qu."]] 
## [1] 1140
  summary(image_details$width)[["1st Qu."]] - 1.5 * IQR(image_details$width) 
## [1] 276

The images having width dimension lesser than 276 and more than 1140 have been identified as outliers.

Distribution of images among the 4 classes

  unique(image_details$Class)
## [1] "other"     "pneumonia" "normal"    "covid"

Covid, Pneumonia, other and normal are the four classes in the dataset.

   plot_3 <- ggplot(data = image_details) + geom_bar(aes(x = Class), fill = "pink", colour = "black") + 
  labs(x ="Image class", y = "Number of patients", title ="Distribution of images as per their label")
 ggplotly(plot_3)

We sensed class imbalance earlier in the data assessment but it is quite evident from the plot above.

Visualizing the distribution of proportions of the different categories

  plot_4 <- ggplot(data = image_details) + geom_bar(aes(x = Class,y = ..prop.., group = 1), fill = "pink", colour = "black",stat ="count") + 
  labs(x ="Image class", y = "Proportion of patients", title ="Distribution of proportion of images as per their label")
 ggplotly(plot_4)

The above plot shows the proportion of images belonging to each class from the entire dataset. Around 42% of the dataset consists of images belonging to ‘other’ category, while only half of it belongs to ‘COVID-positive’ ones. These numbers must be comparable for creating a good solution.

Visualizing number of images as per the video probe

  plot_4 <- ggplot(data = image_details) + geom_bar(aes(x = Probe), fill = "pink", colour = "black") + 
  labs(x ="Image probe", y = "Number of images", title ="Distribution of images as per their probe")
 ggplotly(plot_4)

The videos available had two types of probe : Convex or Linear. Over 14K images have been extracted from videos with convex probe and the rest have been from the linear probe. We must try extracting more images from linear probed videos.

Image resolution : Height and width

   plot_5 <- ggplot(data = image_details, mapping = aes(x = height, y = width)) + geom_point(col = "purple") +
  labs(x = "Dimension : Height", y ="Dimension : Weight", title = "Image resolution")
 
  ggplotly(plot_5)

Having a glance at the above plot, the image resolutions of dataset images can be seen at a glance. Most of them have the height and width dimensions under 500 but some of them are scattered beyond that. Some have low width and a higher height an also otherwise.

  df =  image_details %>% group_by(Source, Class) %>% summarize(n())
## `summarise()` has grouped output by 'Source'. You can override using the `.groups`
## argument.
ggplot(data = image_details) + geom_bar(aes(x = Source, fill = as.factor(Class))) +
  ggtitle("Number of images extracted from each source by class")

The above plot gives a glimpse of images extracted from various sources, category wise. We can focus on the sources, we have not been able to extract much COVID class images from, like no images labelled to COVID class have been extracted from radiopaedia, UF, LITFL. Now that we’d work again on image extraction, we can focus on these sources.

NOTE - We also tried to display some random images from all classes for purpose of exploration. But we had some trouble putting the python code here.So, we did in the jupyter notebook instead. We have also attached another html file for the same.